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Method and device for generating an adapted reference for 
automatic speech recognition 

BACKGROUND OF THE INVENTION 

1. Tech nical Field 

The invention relates to the field of automatic speech recogni- 
tion and more particularly to generating a reference adapted to 
an individual speaker. 

2. Discussion of the Prior Art-. 

During automatic speech recognition a spoken utterance is 
analyzed and compared with one or more already existing refer- 
ences. If the spoken utterance matches an existing reference, a 
corresponding recognition result is output. The recognition 
result can e. g. be a pointer which identifies the existing 
reference which is matched by the spoken utterance. 

The references used for automatic speech recognition can be 
both speaker dependent and speaker independent. Speaker inde- 
pendent references can e. g. be created by averaging utterances 
of a large number of different speakers in a training process. 
Speaker dependent references for an individual speaker, i. e., 
references which are personalized in accordance with an indi- 
vidual speaker's speaking habit, can be obtained by means of an 
individual training process. In order to keep the effort for 
the training of speaker dependent references low, it is prefer- 
able to use a single word spoken in isolation for each speaker 
dependent reference to be trained. The fact that the training 
utterances are spoken in isolation leads to problems for con- 
nected word recognition because fluently spoken utterances 
differ from utterances spoken in isolation due to coarticula- 
tion effects. These coarticulation effects deteriorate the 
accuracy of automatic speech recognition if speaker dependent 
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references which were trained in isolation are used for recog- 
nition of connected words. Moreover, even if connected words 
have been trained, a user's voice may change, e. g. due to 
different health conditions, which also deteriorates the accu- 

5 racy of automatic speech recognition which is based on speaker 
dependent references. The accuracy of automatic speech recogni- 
tion is generally even lower if speaker independent references 
are used, especially when the utterances are spoken in a heavy 
dialect or with a foreign accent. The accuracy of automatic 

10 speech recognition is also influenced by the speaker's acoustic 
environment, e.g. the presence of background noise or the use 
of a so-called hands free set. 

In order to improve the recognition results of automatic speech 
15 recognition, speaker adaptation is used. Speaker adaptation 
allows to incorporate individual speaker characteristics in 
both speaker dependent and speaker independent references . A 
method and a device for continuously updating existing refer- 
ences is known from WO 95/09416. The method and the device 
20 described in WO 95/09416 allow to adapt existing references to 
changes in a speaker's voice and to changing background noise. 
An adaptation of an existing reference in accordance with a 
spoken utterance takes place each time a recognition result 
which corresponds to an existing reference is obtained, i. e., 
25 each time a spoken utterance is recognized. 

It has been found that speaker adaptation of existing refer- 
ences generally improves the accuracy of automatic speech 
recognition. However, the accuracy of automatic speech recogni- 
30 tion using continuously adapted references generally shows 

fluctuations. This means that the recognition accuracy does not 
continuously improve with each adaptation process. To the 
contrary, the recognition accuracy may also temporarily de- 
crease . 

35 

There is, therefore, a need for a method and a device for 
generating an adapted reference for automatic speech recogni- 
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tion which is less prone to a deterioration of the recognition 
accuracy. 

5 SUMMARY OF THE INVENTION 

The present invention satisfies this need by providing a method 
for generating a reference adapted to an individual speaker or 
a speaker's acoustic environment which comprises performing 
? ip automatic speech recognition based on a spoken utterance and 
i-f! obtaining a recognition result which corresponds to a currently 
jjj valid reference, adapting the currently valid reference in 
iji accordance with the utterance, assessing the adapted reference 

and deciding if the adapted reference is used for further 
j5 recognition. A device for generating a speaker adapted refer- 
s ence which satisfies this need comprises a speech recognizer 
ff for performing recognition based on a spoken utterance and for 
CI obtaining a recognition result which corresponds to a currently 

valid reference, an adaption unit which adapts the currently 
-0) valid reference in accordance with the utterance and an assess- 
ing unit which assesses the adapted reference and decides if 
the adapted reference is used for further recognition. 

According to the invention, an assessment step is conducted 
25 after the currently valid speaker dependent or speaker inde- 
pendent reference is adapted. Then, and based on the result of 
the assessment, it is decided whether or not the adapted refer- 
ence is used for further recognition. It can thus be avoided 
that a reference which is adapted in the wrong direction will 
30 be used for further recognition. A deterioration of the recog- 
nition accuracy would take place e. g. if the recognition 
result does not match the spoken utterance and the adaptation 
process is carried out without an assessment step. Conse- 
quently, according to the invention, corrupted adapted refer- 
35 ences can be rejected and e.g. permanent storing of corrupted 
adapted references can be aborted. This is preferably done 
prior to recognizing the next spoken utterance. 
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According to a first aspect of the invention, the adapted 
reference is assessed by determining a distance between the 
adapted reference and the currently valid reference. The deter- 
mined distance can then be used as input for the decision if 
5 the adapted reference is used for further recognition. Moreo- 
ver, when assessing the adapted reference, the distances be- 
tween the adapted reference and all other currently valid 
references which do not correspond to the recognition result 
may additionally be taken into account. This enables to base 
10 the decision if the adapted reference is used for further 
L recognition on the question which of all currently valid refer- 

ences is closest to the adapted reference. Thus, if one of the 
currently valid references which does not correspond to the 
_ recognition result has a smaller distance from the adapted 
15 reference than the currently valid reference which corresponds 
to the recognition result, the adapted reference can be dis- 
carded. 

According to a second aspect of the invention, the assessment 
20 of the adapted reference is based on an evaluation of the user 
behaviour. The recognition result will most likely be wrong if 
e.g. a specific action automatically initiated upon obtaining 
the recognition result is immediately cancelled by the user or 
if the user refuses to confirm the recognition result. In this 
25 case it can automatically be decided that the adapted reference 
is to be discarded and not to be used for further automatic 
speech recognition. An evaluation of the user behaviour can 
also be carried out prior to adapting a currently valid refer- 
ence in accordance with a user utterance. The adaptation proc- 
30 ess is thus not initiated if the user behaviour indicates that 
the recognition result is wrong. In other words, the adaptation 
only takes place if the user behaviour indicates that the 
recognition result is correct. 

35 According to a third aspect of the invention, the adapted 

reference is assessed both by evaluating the user behaviour and 
by determining a distance between the adapted reference and one 
or more currently valid references. 
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The decision if the adapted reference is to be used for further 
recognition may be a "hard" decision or a "soft" decision. The 
"hard" decision may be based on the question whether or not a 
5 certain threshold of a specific parameter obtained by assessing 
the adapted reference has been exceeded. On the other hand, a 
"soft" decision may be made by means of e. g. a neuronal net- 
work which takes into account a plurality of parameters like a 
specific user behaviour and a distance between an adapted 
10 reference and one or more currently valid references. 

In case the decision is based on the distances between the 
adapted reference and one or more existing references, the 
parameters relevant for assessing an adapted reference are 
15 preferably obtained by analyzing a histogram of previously 

determined distances. The distances between the adapted refer- 
ence and one or more existing references are preferably calcu- 
lated with dynamic programming. 

20 If it is decided that the adapted reference is used for further 
recognition, the adapted reference can be stored. In regard to 
storing the adapted reference several strategies can be ap- 
plied. According to the most simple embodiment, an adapted 
reference is simply substituted for the corresponding previ- 

25 ously valid reference. According to a further embodiment, a set 
of adapted references is created which is used in addition to a 
set of currently valid references. Thus, the currently valid 
references constitute a fallback position in case the adapted 
references are e. g. developed in a wrong direction. A further 

30 fallback position can be created by additionally and perma- 
nently storing a set of so-called mother references which 
constitute a set of initially created references. 

Preferably, a currently valid reference is not adapted auto- 
35 matically after a recognition result is obtained but only after 
confirmation of the recognition result. The nature of the 
confirmation depends on the purpose of the automatic speech 
recognition. If the automatic speech recognition is e. g. 



030650-074 



- 6 - 



employed for addressing an entry in a telephonebook of a mobile 
telephone, the confirmation can be a setting up of a call based 
on the recognized entry. 

5 The device for generating an adapted reference according to the 
invention can comprise at least two separate storing means for 
storing two separate sets of references. The provision of a 
plurality of storing means allows to delay the decision whether 
or not to store an adapted reference permanently. For each 

10 utterance recognizable by the means for performing the recogni- 
tion a pair of references can be stored such that a first of 
the two references is stored in the first storing means and a 
second of the two references in the second storing means. Also, 
third storing means can be provided for storing mother refer- 

W ences. 

In connection with the plurality of storing means a selection 
unit can be employed which set pointers that allow to determine 
" I all references currently valid for recognition of spoken utter- 
as ances . According to a first aspect, a pointer is set to this 

reference of each pair of references which is currently valid, 
i. e. which constitutes a reference to be used for automatic 
speech recognition. Upon generating a newly adapted reference, 
the reference of a pair of references to which the pointer is 
25 not set may be then overwritten by the newly adapted reference. 
According to a second embodiment, a pointer is set to the first 
storing means containing a currently valid set of references. 
Prior to or after an adapted reference is created, the content 
of the first storing means is copied in the second storing 
30 means. Then, the adapted reference is stored in the second 

storing means such that a corresponding reference in the second 
storing means is overwritten by the adapted reference. If, 
after an assessment step, it is decided to use the adapted 
reference for further recognition, the pointer is shifted from 
35 the first storing means to the second storing means. Otherwise, 
if it is decided to discard the adapted reference, the pointer 
is not shifted. 



030650-074 



- 7 - 



A further aspect of the invention relates to a computer program 
product with program code means for performing the generation 
of an adapted reference for automatic speech recognition when 
the computer program product is executed in a computing unit. 
Preferably, the computer program product is stored on a com- 
puter-readable recording medium. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Further aspects and advantages of the invention will become 
apparent upon reading the following detailed description of a 
preferred embodiment of the invention and upon reference to the 
drawings in which: 

Fig. 1 is a schematic diagram of a device for generat- 
ing a speaker adapted reference according to the 
invention; 

Fig. 2 is a flow chart depicting a method for generat- 
ing a speaker adapted reference according to the 
invention; and 

Fig. 3 shows histograms of distances between references 
for correct and erroneous recognition results. 

DESCRIPTION OF A PREFERRED EMBODIMENT 

In Fig. 1, a schematic diagram of an embodiment of a device 100 
according to the invention for generating a speaker adapted 
reference for automatic speech recognition is illustrated. The 
device 100 depicted in Fig. l can e. g. be a mobile telephone 
with a voice interface allowing to address a telephone book 
entry by uttering a person's proper name. 

The device 100 comprises recognition means 110 in the form of a 
speech recognizer for performing recognition based on a spoken 
utterance and for obtaining a recognition result which corre- 
sponds to an already existing and currently valid reference. 
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The recognition means 110 communicate with first 12 0 and second 
13 0 storing means and with adaptation means 14 0 for adapting an 
existing reference in accordance with the spoken utterance. The 
first 120 and second 130 storing means constitute the vocabu- 
5 lary of the recognition means 110 . A pair of references is 

stored for each item of the vocabulary such that a first refer- 
ence of each pair of reference is stored in the first storing 
means 12 0 and a second reference of each pair of references is 
stored in the second storing means 13 0. However, only one of 
10 the two reference, which constitute a pair of references is 

defined as currently valid and can be used by the recognition 
means 110. A first pointer (*) is set to the currently valid 
reference of each pair of references. 

15 The adaptation means 14 0 communicate with the first 12 0 and 

second 130 storing means as well as with assessing means 150. 
The assessing means 150 assess the adapted reference and decide 
if the adapted reference is used for further recognition. The 
assessing means 15 0 communicate with pointing means 160 in the 

20 form of a selection unit which communicate with the first 12 0 
and second 13 0 storing means. 

The device 10 0 depicted in Fig. 1 further comprises third 
storing means 170 for storing a mother reference for each item 

25 of the vocabulary of the recognition- means 110. The mother 

references stored in the third storing means 170 are initially 
created speaker dependent or speaker independent references . 
The mother references can be used both in parallel to the 
references stored in the first 120 and the second 130 storing 

30 means or as a fallback position. 

Referring now to the flow chart of Fig. 2, the function of the 
device 100 is described in more detail. 

35 Upon receipt of a signal corresponding to a spoken utterance, 
the recognition means 110 perform a recognition process. The 
signal can be provided e. g. from a microphone or a different 
signal source. The recognition process comprises matching a 
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pattern like one or more feature vectors of the spoken utter- 
ance with corresponding patterns of all currently valid refer- 
ences stored in either the first 12 0 or the second 13 0 storing 
means, i.e., all references to which the first pointers (*) are 
5 set. The first pointers (*) were set by the pointing means 160 
as will be described below in more detail. If the pattern of 
the spoken utterance matches a pattern of a currently valid 
reference, a recognition result which corresponds to the match- 
ing reference is obtained by the recognition means 110. The 
10 recognition result is formed by a second pointer {>) which 

points to the pair of references which contains the currently 

Jj valid reference matched by the spoken utterance. 

ys 

After the recognition result is obtained, the telephone book 
005 entry, e.g. a person's proper name, which corresponds to the 
j; recognition result may be acoustically or optically output via 
« an output unit not depicted in Fig. 1 to a user for confirma- 

j 5 ** tion. After the user confirms that the output is correct, e. g. 
□ by setting up the call to the person corresponding to the 
20 telephone book entry, the recognition result is output to the 
rr adaptation means 140. 

The adaptation means 140 load the valid reference of the pair 
of references to which the second pointer (>) is set. The 

25 adaptation means 140 then adapt the loaded reference in accor- 
dance with the spoken utterance. This can e. g. be done by 
shifting the feature vectors of the loaded reference slightly 
towards the feature vectors of the spoken utterance. Thus, an 
adapted reference is generated based on both the loaded refer- 

30 ence and the spoken utterance. After the adapted reference has 
been generated, the adaptation means 140 store the adapted 
reference by overwriting the non- valid reference of the pair of 
references to which the second pointer (>) is set. 

35 After the adapted reference has been stored, the assessing 
means 15 0 assess the adapted reference as will be described 
below. The assessing is necessary since the user confirmation 
might have been wrong. If the user has e.g. erroneously con- 
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firmed a wrong recognition result, the adapted reference would 
be corrupted. The adapted reference has therefore to be as- 
sessed. Based on the result of the assessment, the assessing 
means 150 decide if the adapted reference is used for further 
5 recognition. 

If it is decided to use the adapted reference for further 
recognition, the first pointer (*) is shifted within the cur- 
rent pair of references to the adapted reference by the point- 
10 ing means 160. Consequently, the newly adapted reference to 
U which the first pointer (*) has been set will constitute the 
3 valid reference in terms of recognizing a subsequent spoken 

utterance. The adapted reference is thus stored permanently. On 
%{ the other hand, if it is decided to reject the adapted refer- 
-1-5 ence, the position of the first pointer (*) is not changed 
^ within the pair of references to which the second pointer (>) 
1-. is set. Consequently, the adapted reference may be overwritten 
^ in a subsequent recognition step. 

r: 20 In the following, the function of the assessing means 150 is 
r * described by means of two exemplary embodiments . 

According to a first embodiment, the assessing means 150 deter- 
mine the distance between the newly adapted reference and the 

25 currently valid reference of the pair of references to which 
the second pointer (>) is set. The distance is calculated by 
means of dynamic programming (dynamic time warping) as de- 
scribed in detail in "The Use of the One-Stage Dynamic Program- 
ming Algorithm for Connected Word Recognition" , IEEE Transac- 

30 tion Acoustics, Speech and Signal Processing, Volume ASSP-32, 
No. 2, pp. 263 to 271, 1984, herewith incorporated by refer- 
ence . 

After the distance has been calculated, the assessing means 150 
35 decide whether or not to use the adapted reference for further 
recognition, i.e., whether or not to shift the corresponding 
first pointer (*) , based on a distance threshold. This means 
that the adapted reference is only used for further recognition 
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if the distance between the adapted reference and the corre- 
sponding currently valid reference does not exceed this thresh- 
old. 

5 The distance threshold can be obtained by analyzing a histogram 
of previously determined distances corresponding to real life 
data as depicted in Fig. 3. Fig. 3 shows two different histo- 
grams 300, 310. The first histogram 300 reflects the distances 
between adapted references and existing references for correct 
10 recognition results and the other histogram 310 for erroneous 

recognition results. From Fig. 3 it becomes clear that the 
■-^ average distance for correct recognition results is smaller 
il than the average distance for erroneous recognition results. In 
y; order to create the histograms depicted in Fig. 3, a large 
-is amount of data has to be analyzed. 

s The above mentioned distance threshold for deciding whether an 

h adapted reference is used for further recognition is chosen 
O e.g. at the crossing or in the vicinity of the crossing of the 
'20 two histograms 300, 310 (as indicated by the line 320) . Conse- 
- quently, an adapted reference is used for further recognition 
if the calculated distance does not exceed the distance corre- 
sponding to the line 320. 

25 If only the distance between the adapted reference and the 

corresponding currently valid reference of the pair of refer- 
ences to which the second pointer (>) is set is determined, 
only the shifting between these two references is taken into 
account. According to a second, exemplary embodiment for as- 

30 sessing the adapted reference, the distance between the adapted 
reference and all other currently valid references to which 
first pointers (*) are set are additionally calculated. Conse- 
quently, the shifting of the adapted reference with respect to 
all currently valid references constituting the vocabulary of 

35 the recognition means 110 is taken into account. The distances 
can again be calculated by means of dynamic programming. 



030650-074 



- 12 - 



After the distances between the adapted reference and all other 
currently valid references have been calculated, the currently 
valid reference having the smallest distance from the adapted 
reference is determined. Should the currently valid reference 
5 having the smallest distance from the adapted reference not 
correspond to the pair of references to which the second 
pointer (>) is set, it is decided that the adapted reference is 
not used for further recognition. Otherwise, should the cur- 
rently valid reference with the smallest distance from the 
10 adapted reference belong to the pair of references to which the 
~i second pointer (>) is set, the adapted reference is used or 
€) further recognition. 

m 

Ci It will be appreciated by those of ordinary skills in the art 

15 that this invention can be embodied in other specific forms 

£ without departing from its essential character. The embodiments 

f i; described above should therefore be considered in all respects 

J= : to be illustrative and not restrictive. The scope of the inven- 

Ol tion is determined solely by the following claims, and all 

;20 modifications that fall within that scope are intended to be 
included therein. 



